course: (ML Series) Support Vector Machine

By Julien Hernandez Lallement, 2020-11-05, in category Course

machine learning, python

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set_style('darkgrid')
import statsmodels.api as sm


def illustrate_svm(a=1.3, b=0.5, n=30, r=1, make_prediction=False, other_lines=False, main_split=False, margins=False):
    """
    params: 
    a: slope
    b: intercept
    n: sample size
    r: error
    make_prediction: boolean. Does not make a real prediction, simply adds a value for illustrative purposes
    random_lines: plots random regression lines
    plot_errors: plot the difference between example points and regression line, for illustrative purposes
    """
    np.random.seed(42)
    def f(x, r_low=0, r_high=1): return (a * x + b + r * np.random.uniform(r_low, r_high, len(x)))

    x = np.random.rand(n)
    y = f(x)
    fig, ax = plt.subplots()
    ax.set_xlim([-0.05, 1.15])    
    # plot regression result
    plt.scatter(x, y)
    if main_split:
        sns.regplot([0.7,0.5], [1, 2.25], ci=None, scatter=False)
        if margins:
            sns.regplot([0.74,0.54], [1, 2.25], ci=None, scatter=False, color='black') 
            sns.regplot([0.66,0.46], [1, 2.25], ci=None, scatter=False, color='black')  
    #plt.plot([[0.8,1], [0.4, 2.25]])
    # plot random lines, similar to the fitted regression result
    if other_lines:
        sns.regplot([0.8,0.45], [1, 2.25], ci=None, scatter=False)
        sns.regplot([0.65,0.55], [1, 2.25], ci=None, scatter=False)
    # make prediction with regression result
    if make_prediction:
        x_ = np.array(max(x) + 0.1).reshape(1,1) 
        plt.plot(x_, f(x_), 'ro')
    plt.xlabel('x', fontsize=15)
    plt.ylabel('y', fontsize=15)
    plt.show()

Support Vector Machines

General Background

Support Vector Machines (SVM) are supervised learning classification methods. In a nutshell, this approach divides data respective to a hyperplance (vector in multi dimensional space) in feature space.

Use case

SVM can be used mainly with the following goal in mind:

  • Classifying data based on numerical vectors

Theoretical Background

Say you have two numerical features that you can plot in a two dimensional space

In [3]:
illustrate_svm()

One way to separate two groups of data point would be to draw a line to separate the different categories. Such line can be called a hyperplane. A hyperplane is a subspace with dimension N-1, N being the number of dimensions of the space in which its located.

Let's draw one possible hyperplane

In [168]:
illustrate_svm(main_split=True)

The question is, which line would do the better job at separating two groups of data points?

In [5]:
illustrate_svm(main_split=True, other_lines=True)

One way to approach this problem is to define the best line as the one that has the largest margins between itself and the data points.

In [6]:
illustrate_svm(main_split=True, margins=True)

The goal of an SVM approach is to find the margins that are the furthest apart. You might wonder what the Support Vector part in SVM means? The support vectors are the data points from each category that are closest to the hyperplane. They support the hyperplane and if they were to change, the hyperplane placement would change as well. The Machine part in SVM? Not sure, it seems it simply means that this algorithm allows machine to learn...

The margins are depicted in the plot above, in black. This concept if called max-margin hyperplane, which is the mid-line of the largest gap we can find between the data points. Using a max-margin hyperplane is relatively seldom is real world data, and has a few down sides, one of the most important being its sensitivity to outliers (if one of the support points is an outlier from the category, then the algorithm would be overfitting the data).

Typically, users will find a balance between a max-margin hyperplane and too many violations of that hyperplane, that is, data points that wall on the other (wrong) side of the hyperplane. That parameter is typically referred to as C, which can be nicely tunedvia cross-validation.

This section is taken from Tyler Folkamn slides: "A smaller value of C leads to more margin violations but a larger distance between the hyperplane and the support vectors. Inversely, a higher value has less margin violations, but a smaller distance between the hyperplane and support vectors. Reducing C leads to more bias, and increasing it more variance."

The Kernel Trick

It turns out that to solve the SVM model the only computations required are inner products of the observations. The inner product here for two rows of data is just the sum of the features multiplied by eachother. If we have P features:

$$<x_{i},x_{j}> = \sum_{k=1}^{P}x_{ik}x_{jk}$$

The kernel trick basically says that everytime we see an inner product when solving the SVM, we replace it with a more general form. The form shown above is known as the linear form. But we also have a polynomial form:

$$K(x_{i},x_{j}) = (1 + \sum_{k=1}^{P}x_{ik}x_{jk})^d$$

In the polynomial form, we have a $d$ parameter which is the degree of polynomial we would like to use. It turns out that this is a much more efficient way of adding polynomial features to an SVM.

Another popular form is the radial form:

$$K(x_{i},x_{j}) = exp(-\gamma \sum_{k=1}^{P}(x_{ik} - x_{jk})^2)$$

In this form, we have the $\gamma$ parameter which is > 0. If the two values being compared are very far, we get the exp of a large negative number, which is a very small number. This means that observations far away have a small effect, so this kernel have very local behavior. This is an especially cool kernel because the enlarged feature space is infinite-dimensional, so without the kernel trick, we could never do this!

Note: Choosing a large value for $\gamma$ increases variance and a small value increases bias. You can also try to find the best kernel and parameter values using cross-validation

Practical Demonstration

I will use the famous Iris datasets that you can get directly from Sklearn

In [8]:
from sklearn.datasets import load_iris
In [9]:
df=load_iris(as_frame=True)

In this data, you get three different categories, setosa, versicolor and virginica, with 50 observations each.

In [103]:
# Name dicft for legend mapping
target_name = {0: 'setosa',
               1: 'versicolor',
               2: 'virginica'}
In [104]:
# Plot data
sns.scatterplot(df.frame['petal length (cm)'],df.frame['sepal width (cm)'], hue=df.frame.target.map(target_name));

Sklearn can handle more than 2 classes with a one-vs-one method. It basically takes all the different pairs of labels and trains a seperate classifier for each one. All models are then run for prediction, each giving one output per class. The highest voted class is the winner. This this is quite computationnaly demanding, and I am doing this for demonstration, I will focus on two dimensions.

So I take the first two columns:

In [105]:
df.data.iloc[:, 0:2]
Out[105]:
sepal length (cm) sepal width (cm)
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
4 5.0 3.6
... ... ...
145 6.7 3.0
146 6.3 2.5
147 6.5 3.0
148 6.2 3.4
149 5.9 3.0

150 rows Ɨ 2 columns

Let's first test the model

In [106]:
from sklearn.model_selection import train_test_split
In [107]:
X_train, X_test, y_train, y_test = train_test_split(df.data.iloc[:, :2], df.target)
In [108]:
from sklearn.svm import SVC
In [160]:
model = SVC(kernel='linear', C=100)

Note that you can use different types of kernels. In this case, I will be uing a basic linear SVC, but I recommend you look for other kernels. In particular, the polynonial kernel (used in conjunction with a parameter degree) can be quite useful in some cases.

Note as well the regularizatin parameter C, here equal to 100. As discussed above, this C parameter influences the SVM optimization by providing a approximation of misclassition acceptance in each training example. In other words, when C is high, the algorithm will choose a smaller-margin hyperplane if it gets more training points classified correctly. Conversely, when C has a low value, it will cause the optimizer to look for a larger-margin separating hyperplane, accepting more misclassifications.

In [161]:
model.fit(X_train, y_train)
Out[161]:
SVC(C=100, kernel='linear')
In [162]:
print('Accuracy on training set: {:.2f}'.format(model.score(X_train, y_train)))

print('Accuracy on test set: {:.2f}'.format(model.score(X_test, y_test)))
Accuracy on training set: 0.79
Accuracy on test set: 0.79

We can tune the hyperparameter C to find the optimal value, using GridSearch:

In [163]:
from sklearn.model_selection import GridSearchCV
    
c = np.linspace(start = 0, stop = 300, num=50)
param_grid = {'C': c}


grid = GridSearchCV(model, param_grid =param_grid, cv=3, n_jobs=-1, scoring='accuracy')
grid.fit(X_train, y_train)
  
print("The best parameters C value is %s with a score of %0.0f" % (grid.best_params_, grid.best_score_ * 100 ))
print( "The best estimator accuracy on test set = {:.2f} ".format(grid.best_estimator_.score(X_test, y_test) * 100 ) )
The best parameters C value is {'C': 6.122448979591836} with a score of 80
The best estimator accuracy on test set = 78.95 

Since we used a visualization at the beginning of the notebook, let's plot the decision boundaries of the clusters that SVM differentiated

As a reminder, we are looking at these two dimensions:

In [164]:
# plot the samples
plt.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, cmap=plt.cm.Paired, edgecolors='k');

The code below is inspired by the Sklear repo:

In [165]:
# Create meshgrid to plot
x_min, x_max = X_train.iloc[:,0].min() - 1, X_train.iloc[:,0].max() + 1
y_min, y_max = X_train.iloc[:,1].min() - 1, X_train.iloc[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

# predict position for meshdrig
Z = grid.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)


# plot the samples
plt.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, cmap=plt.cm.Paired, edgecolors='k')
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.5);

As mentioned earlier, the kernel used was a linear kernel, which can be observed in the decision boundaries above.

Let's quickly re-run the code above, this time using a polynomial kernel with 2 degrees

In [167]:
model = SVC(kernel='poly', degree=2)
model.fit(X_train, y_train)
# Tune Hyperparameter    
c = np.linspace(start = 0, stop = 300, num=50)
param_grid = {'C': c}


grid = GridSearchCV(model, param_grid =param_grid, cv=3, n_jobs=-1, scoring='accuracy')
grid.fit(X_train, y_train)
  
print("The best parameters C value is %s with a score of %0.0f" % (grid.best_params_, grid.best_score_ * 100 ))
print( "The best estimator accuracy on test set = {:.2f} ".format(grid.best_estimator_.score(X_test, y_test) * 100 ) )

# Create meshgrid to plot
x_min, x_max = X_train.iloc[:,0].min() - 1, X_train.iloc[:,0].max() + 1
y_min, y_max = X_train.iloc[:,1].min() - 1, X_train.iloc[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

# predict position for meshdrig
Z = grid.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)


# plot the samples
plt.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, cmap=plt.cm.Paired, edgecolors='k')
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.5);
The best parameters C value is {'C': 244.89795918367346} with a score of 79
The best estimator accuracy on test set = 84.21 

Side note: Problems with the resulting clusters.

As often with clustering algorithms, the extrapolation made per clusters can be quite bothering. If you look at the plot above, it seems quite odd that the algorithm is predicting a class for the whole upper left part of the graph, for which no data points are present. This is just a side note on such algorithms, where alternatives such as Mixture Gaussian models can be quite useful, since they would allow you to assess uncertainty, for which you can use a threshold above which no predictions could be made (up to you ;) )

Final words

That was it for Support Vector Machines. Hope you enjoyed it, and drop a mail if you have any comments, questions or (constructive_ criticism).